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Abstract —Model predictive control (MFC) is an effec¬ 
tive method for controlling robotic systems, particularly au¬ 
tonomous aerial vehicles such as quadcopters. However, ap¬ 
plication of MFC can be computationally demanding, and 
typically requires estimating the state of the system, which 
can be challenging in complex, unstructured environments. 
Reinforcement learning can in principle forego the need for 
explicit state estimation and acquire a policy that directly 
maps sensor readings to actions, but is difficult to apply to 
unstable systems that are liable to fail catastrophically during 
training before an effective policy has been found. We propose 
to combine MFC with reinforcement learning in the framework 
of guided policy search, where MFC is used to generate data 
at training time, under full state observations provided by an 
instrumented training environment. This data is used to train 
a deep neural network policy, which is allowed to access only 
the raw observations from the vehicle’s onboard sensors. After 
training, the neural network policy can successfully control the 
robot without knowledge of the full state, and at a fraction of 
the computational cost of MFC. We evaluate our method by 
learning obstacle avoidance policies for a simulated quadrotor, 
using simulated onboard sensors and no explicit state estimation 
at test time. 

I. Introduction 

Model predictive control (MFC) is an effective and re¬ 
liable method for controlling robotic systems, particularly 
autonomous aerial vehicles such as quadcopters, because of 
its robustness to moderate model errors [26], ability to use 
high-level objectives [41], and relative simplicity. However, 
applications of MFC can be computationally demanding, and 
typically require estimating the state of the system. The state 
estimation problem can be quite challenging in complex, 
unstructured environments. Reinforcement learning can in 
principle forego the need for explicit state estimation and 
acquire a policy that directly maps sensor readings to actions 
[5]. The power of reinforcement learning is derived from 
its ability to learn directly from the real-world behavior of 
the system. Unfortunately, this strength is also its major 
weakness when applied to unstable, fragile systems such 
as aerial vehicles, which can be damaged beyond repair by 
an unsuccessful, partially trained policy (e.g. by crashing 
into an obstacle). While alternative learning methods, such 
as learning from demonstration [36], [37], can address this 
issue, they typically require costly additional information, 
such as guidance from a human expert. 

We propose to use an off-policy guided policy search 
algorithm in combination with a model predictive control 
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Fig. 1: Diagram of our method: the training phase alternates 
between running MFC to attempt the task and collect data under 
full state observation, and using this data to train a neural network 
policy that chooses actions based only on the vehicle’s onboard 
sensors. At test time, the neural network does not need the full 
state, and can control the vehicle in unstructured environments. 

(MFC) scheme to train policies for autonomous aerial ve¬ 
hicles in a way that avoids catastrophic failure at training 
time. Guided policy search transforms RL into supervised 
learning, where an optimal control algorithm provides the 
supervision and the final control policy is trained with 
supervised learning. Typically, this optimal control algorithm 
is an offline trajectory optimization procedure, which either 
assumes a known model of the dynamics [21] or uses an 
iteratively learned model [19]. Both approaches are prone 
to failure during training because the known model may 
be inaccurate while the learned model is always inaccurate 
during the early stages of learning. By substituting MFC 
for offline trajectory optimization, we can obtain a variant 
of guided policy search that is robust to moderate model 
errors, and thus avoid catastrophic failures during training. 
Furthermore, since the final policy is trained with supervised 
learning, we can train complex, high-dimensional, and highly 
nonlinear policies, such as deep neural networks, which can 
represent a wide range of complex behaviors. 

One might wonder why the guided policy search method 
is necessary if we already have access to an effective MFC 
procedure. In the case of autonomous aerial vehicles, training 
deep neural network policies with guided policy search 
affords us two main advantages. First, the neural network 
policy does not need to use the same inputs as MFC. In fact. 






















we can restrict its inputs to only those observations that are 
directly available from the vehicle’s onboard sensors, such 
as IMU readings and data from laser range finders, while the 
MPC training phase uses the full state of the system. Since 
the policy is represented by a deep neural network, it can 
even process complex, raw sensor information. For example, 
prior work has shown that guided policy search can learn 
policies that directly use camera images [23]. Since MPC is 
only used at training time, we can employ an instrumented 
training setup, where the full state is known at training time 
(e.g. using motion capture), but unavailable at test time. 
This instrumented training setup is one of the key benefits 
of our approach, since it allows for safe training with full 
state information, but still produces a policy that uses raw 
sensor readings and does not require explicit state estimation. 
The second benefit of this approach is that the final neural 
network policy is computationally much less expensive than 
MPC, and can be easily parallelized on specialized hardware. 
This advantage combines elegantly with the instrumented 
training setup, since the MPC solution can be computed 
offboard during training, while all policy computations may 
be performed onboard at test time. 

Our main contribution is an MPC-guided policy search 
algorithm that can be used for learning control policies 
for autonomous aerial vehicles. This algorithm, illustrated 
in Figure replaces the offline trajectory optimization 
that is typically used in guided policy search with online 
MPC, which continuously replans paths to the goal from 
the vehicle’s current state using an approximate model of 
the dynamics. Our modihed MPC procedure also takes into 
account the actions that would be taken at each state by 
the current neural network in order to avoid actions that the 
network is unlikely to take. This ensures that, at convergence, 
the neural network achieves good long-horizon performance, 
despite being trained only with supervised learning. Our 
approach allows us to learn neural network policies that 
directly process raw observations from the vehicle’s onboard 
sensors, and are substantially faster to evaluate at test time 
than full MPC solutions. We demonstrate our method on a 
set of simulated quadrotor control tasks, including obstacle 
avoidance using simulated laser range sensors. We show that 
our approach can learn policies that are robust to a variety 
of perturbations and generalize successfully to large obstacle 
courses, without catastrophic failure during training. 

II. Related Work 

Model predictive control (MPC) is an effective and pop¬ 
ular technique for control of robotic systems, and is fre¬ 
quently used to control autonomous aerial vehicles such 
as quadrotors [38], [35], [34], [3], [2], [4], [29]. MPC is 
straightforward to apply when the state of the system is 
known (e.g. via a motion capture system), or when it can be 
measured accurately through sensors with well-understood 
observation models. However, vehicles navigating complex, 
unstructured environments must use more complex sensors, 
such as cameras and laser range finders. Incorporating such 
sensors into optimal control directly is challenging, since the 


sensor reading depends on a complex and often unknown 
environment. This challenge is conventionally addressed by 
using localization and mapping algorithms to map out the 
environment [9] and then optimizing trajectories under the 
resulting map [11]. However, this kind of model-based 
approach is quite challenging when the vehicle is moving 
at high speed, or when onboard computation is limited. 

On the other end of the spectrum from such model-based 
methods, reinforcement learning (RL) aims to directly learn 
control policies that map observations to controls [5]. This 
approach in principle removes the need for explicit state 
estimation and extensive computation at test time, by using 
a number of training episodes to iteratively improve the 
policy from real-world experience. RL has been used to train 
robotic controllers for games such as ball-in-cup and table 
tennis [16], manipulation [31], [6], and robotic locomotion 
[17], [40], [10], [8]. An overview of recent reinforcement 
learning methods in robotics can be found in a recent survey 
paper [15]. However, model-free RL is difficult to apply to 
unstable systems such as quadrotors, due to the possibility 
of catastrophic failure during training. Model-based RL can 
mitigate this problem by training a model from real-world 
experience, and then optimizing the policy under this model. 
While such methods have been successfully applied to aerial 
vehicles [1], [28], the requirement to be able to acquire an 
accurate model means that these methods share many of the 
challenges of MPC methods. 

In this work, we use an off-policy reinforcement learning 
method called guided policy search, which incorporates the 
advantages of model-based methods at training time, while 
still training the policy to use only the onboard sensors 
of the robot, without explicit state estimation and using 
only real-world data. Guided policy search has been applied 
to locomotion [21], robotic manipulation [22], and vision- 
based robotic control [23], but all prior applications rely 
on an offline trajectory optimization phase to generate the 
controller that is then executed on the real system. While this 
offline optimization might use a learned model of the system 
dynamics [19], the resulting trajectory-centric controller is 
only adapted between episodes. This makes these methods 
liable to fail catastrophically when the model is inaccurate. 
We replace the offline trajectory optimization in guided 
policy search with MPC, which prevents catastrophic failures 
even during training, making the method suitable for learning 
policies for autonomous aerial vehicles. 

One of the key advantages of guided policy search is 
its ability to train complex, high-dimensional, and highly 
nonlinear policies [20]. This allows us to use deep neural 
network representations for our policies, thus allowing them 
to handle complex, raw input from onboard sensors, without 
extensive engineering of the policy parameterization. While 
neural networks have been used for control for decades [12], 
[33], limitations on computation and algorithms have made 
large neural network policies very difficult to learn. More 
recently, deep neural network policies have been used for 
tasks ranging from robotic control [22], [23] to video game 
playing [27]. Deep neural networks have also been used to 


learn models for MPC [18]. In this paper, we use neural 
networks to represent the policy, rather than the model, while 
MPC is used to help train this policy. 

III. Preliminaries 

The goal of policy search is to minimize the expected 
cost with respect to the parame¬ 

ters 0 of a policy 7rg(ut|ot). Here, denotes the full 
state at time t, Ot the observation, Ut the action, and 
£(x(,Ut) the cost function that dehnes the task. For ex¬ 
ample, a task that requires a quadrotor to fly to a po¬ 
sition might have the cost be the distance between the 
vehicle and the goal. The expectation is taken with re¬ 
spect top(T) =p(xi)nLiP(xt+i|xt,Ut)7re(ut|xt), where 
T = {xi, Ui,... ,xr, ut} denotes a trajectory. The observa¬ 
tion Ot is distributed according to some unknown observation 
distribution p{ot\xt), which describes how the readings from 
the robot’s sensors depend on the state. 

Optimal control also seeks to solve problems of this 
type, under various assumptions about the dynamics 
p(xt+i|xi, Ut), observation function p(ot|x(), and policy. 
For example, the differential dynamic programming (DDP) 
algorithm can be viewed as approximately optimizing the ex¬ 
pected cost under a locally linear-Gaussian dynamics model 
and with time-varying linear [13], [24] or linear-Gaussian 
[20] policies that act directly on the state (i.e. Ot = Xt). 
While this method is not as general as policy gradient 
RL algorithms, which can optimize arbitrary parameterized 
policies with arbitrary observations under unknown dynamics 
[32], it is fast, simple, and often effective [39]. In order to 
combine the efficiency of DDP with the flexibility of general 
policy search methods, guided policy search uses DDP-like 
algorithms to solve the control problem from a variety of 
initial states and generate training data for arbitrary param¬ 
eterized policies, which are then trained with supervised 
learning to mimic the behavior of the DDP solutions [20], 
[19]. 

Algorithm [T] presents a generic guided policy search 
method, where trajectory optimization is used to optimize 
a set of guiding trajectory distributions Pi{T), defined by the 
corresponding linear-Gaussian controllers pi(ut|xt), and an 
arbitrary nonlinear policy 7rg(ut|ot) is trained using samples 
from all of these controllers. Since supervised learning does 


Algorithm 1 Generic guided policy search summary 
I: for iteration fc = 1 to iT do 

2: Optimize trajectory distributions Piir) to minimize 

Epi[i{T)] and deviation from the policy 7re(ut|o4) 

3: Generate samples {r/} from each Pi(r) 

4: Train nonlinear policy tt^ (u* | Oj) to match the sampled 

trajectories {t|} 

5: Update Lagrange multipliers to encourage agreement 

between pi{ut\xt) and 7re(ut|xt) 

6 : end for 

7: return optimized policy parameters 6 


not in general guarantee that the policy 7re(ut|ot) will 
achieve good long-horizon performance [36], guided policy 
search alternates between optimizing the policy and optimiz¬ 
ing each of the trajectory distributions, each time adjusting 
the trajectory cost and the policy optimization objective to 
ensure that the linear-Gaussian controllers pi(u(|xt) and pol¬ 
icy 7rg(ut|ot) converge to the same behavior. The objective 
for trajectory optimization is modified by adding a penalty 
for the deviation from the policy, and the policy objective is 
modified by applying different weights to different samples 
[19] or using dual variables [23]. Convergence to a policy 
7re(ut|ot) that minimizes expected cost can be shown by 
casting this alternating optimization as a relaxation of a 
constrained optimization problem of the form 

min ETr^p[£(T)] 
e,p 

s.t. p{ut\xt) = 7re(ut|xt) Vf 

where £(r) is shorthand for T^e{ut\xt) is 

shorthand for J TTg{ut\ot)p{ot\xt)dot, and p{t) is a mixture 
of the guiding distributions Pi^r). This constrained prob¬ 
lem can be approximately transformed into the alternating 
optimization in Algorithm fl] by using sampling over the 
observations o* together withthe framework of dual gradient 
descent [19] or BADMM [23]. In this paper, we use the 
BADMM version, which specifies the following objective 
for trajectory optimization; 

T 

min Y] kxt, ut) - uj 

L 

-f t'tDKL(pi(ut|xt)||7re(ut|xt))j (1) 

where Xpt is a Lagrange multiplier on the mean action, 
and the third term is a KL-divergence penalty. Together, 
these terms serve to keep pi(ut|xt) close to 7rg(ut|xt). The 
supervised objective for the policy is similarly given by 

N M T 

mm X] X] X] [EtDKL (tts(uj ) ||pi (u* jx]’-’)) 

i = l j^l t = l 

+ ( 2 ) 

where N is the number of trajectory distributions Pi, M is 
the number of samples collected from each Pi(r) and T is 
the length of each trajectory r. This objective uses samples to 
estimate the integral f 7r6)(u(|o()p(ot|x()(iot and, in the case 
where the policy is given by the Gaussian 
it corresponds to a weighted least squares objective on the 
mean p^ipt), while Y7 can be solved for in closed form. For 
a detailed derivation of this method, as well as the update 
equations for vl and we refer the reader to previous 
work [23]. Note that the policy 7r6i(ut|o4) only uses the 
observations Oj as input, which means that, once it has been 
trained, it can be used in situations where the true state Xj 
is unknown. 

Prior guided policy search methods optimized the guid¬ 
ing trajectory distributions Pi{T) using either offline tra¬ 
jectory optimization with known system dynamics [21], or 
trajectory-centric reinforcement learning [19]. The former 





class of methods assumes that the true dynamics are known 
in advance, while the latter requires iteratively learning 
the dynamics by attempting to run potentially suboptimal 
controllers on the real physical system. In the case of 
unstable systems, such as autonomous aerial vehicles, neither 
approach is ideal, since the true dynamics are not known 
perfectly, and the suboptimal controller rollouts required 
for reinforcement learning might cause catastrophic failure, 
such as a crash. On the other hand, MPC methods that 
continuously recompute the vehicle’s trajectory under an 
approximate model of the dynamics have been shown to 
exhibit good robustness to model errors [39]. In the next 
section, we discuss how MPC can be combined with guided 
policy search to learn effective control policies. 

IV. MPC-Guided Policy Search 

In this paper, we use MPC together with offline trajectory 
optimization to generate guiding samples for guided policy 
search. We assume that we have access to an approximate 
model of the system dynamics, which we use during training 
to choose actions that will accomplish the desired task, start¬ 
ing from a variety of initial states. These samples are then 
used as training data to train a nonlinear policy 7re(u4|ot), 
and this policy is included in the cost function for the next 
batch of samples. By repeatedly collecting new samples and 
training the policy 7re(u(|o() in this way, the method can 
acquire an effective nonlinear policy that generalizes to new 
states. 

This method is a special case of the generic guided policy 
search framework presented in Algorithm but a number 
of modifications are necessary to adapt the approach to 
use MPC to generate the guiding trajectory distributions. 
First, the MPC procedure must minimize the objective in 
Equation Q, which means that it must also minimize de¬ 
viation from the neural network policy. Second, each MPC 
rollout produces a different locally linear controller, which 
necessitates a modification to the supervised policy learning 
phase. Lastly, since MPC uses a relatively short horizon, 
we generate target trajectories using an offline trajectory 
optimization phase, and then track these trajectories. We 
develop a formulation for this tracking objective that is 
compatible with guided policy search. 

A. Model Predictive Control with DDP 

The MPC method we use is based on differential dynamic 
programming (DDP) [13]. In particular, we use an efficient 
variant of this method called iterative LQG, which assumes 
access to an approximate model of the system dynamics and 
uses a local linear-quadratic expansion to solve the optimal 
control problem. We summarize the method in this section. 
However, our derivation largely follows prior work [41]. 

Iterative LQG assumes that the dynamics are given by 
a deterministic mean function = E[xt+i\xt,Ut], 

with additive Gaussian noise. The algorithm computes a 
linear expansion of the dynamics around a nominal tra¬ 
jectory f = {xi, Ui,..., XT, ut}, as well as a quadratic 


expansion of the cost. Without loss of generality, we as¬ 
sume that the nominal states and actions are zero for 
notational convenience. The linearized dynamics have the 
form p(xt+i|xt, Ut) = A/'(/xtXt -f /utW -f /c*, FJ, and the 
quadratic cost approximation has the form 

f(xt,ut) « i[xt;ut]'^4u,xui[xt;ut]-f[xt;ut]'^4ut-fconst, 

where subscripts denote derivatives, e.g. 4ut is the gradient 
of £ with respect to [xt;U(], while 4u.xut is the Hessian. 
Under this model of the dynamics and cost function, the 
optimal policy can be computed by recursively computing 
the quadratic Q-function and value function, starting with 
the last time step. These functions are given by 

F(xt) = 14,xtXt -I- xj 14t -I- const 
Q(xt,ut) = ^[xt;ut]'^Qxu,xu4xt;ut]-l-[xt;ut]'^Qxut-l-const 

We can express them with the following recurrence; 

Qxu.xut — 4u,xul /*xuii^,xl-t-l./xui 
Qxut — 4ui + fxut^xt+1 
Fx.xt = Qx.xt Q\l,xtQ\l,UtQu,Xt 
Fxt = Qxt ~ Qu.xtQu.utQut^ 

which allows us to compute the optimal control law as 
g{xt) =ut + kt+ Kt{xt - xt), where K* = -Q~^utQu.xt 
and k( = —Qu\itQv.t- Performing a forward rollout using 
this control law allows us to find a new nominal trajectory, 
and the backward dynamic programming pass is repeated 
around this trajectory to take the next Gauss-Newton step. 

To adapt DDP for performing MPC, we simply run the 
algorithm for a shorter horizon H at each time time step t, 
so that the backward pass is performed from time step t to 
t + H. 

B. Adapting MPC for Guided Policy Search 

While we could simply adapt the DDP-based MPC algo¬ 
rithm in the previous section to optimize Equation [T] the 
short horizon typically used in MPC makes it difficult to 
accomplish complex tasks like obstacle avoidance, which 
require long-horizon lookahead, with only a high-level spec¬ 
ification of the task, such as a desired flight direction and 
an obstacle collision penalty. Instead, we use an offline 
optimization based on iterative LQG [24] to first generate 
a reference trajectory, and then track this trajectory using 
MPC, with an additional term to account for differences from 
the neural network policy 7re(ut|ot). 

Offline, we run iterative L(^ with our known ^proximate 
model to optimize Equation H Since Equation [H contains a 
KL-divergence term, the objective can be rewritten as 

T 1 1 

min V4^(x*.uo[^^(xt,Ut) - -uf 
pat; t t 

- log7re(ut|xt) - 4^(pi(ut|xt))j, (3) 

where PL represents the maximum entropy, and this maxi¬ 
mum entropy objective can be optimized with iterative LQG. 


The solution is a linear-Gaussian controller of the form 
Pi(ut|xt) = A/'(KtXt + kt,( 5 “u 4 ), as shown in prior work 
[20]. Prior methods sample directly from the linear-Gaussian 
controller pi(ut|xt) [21], but since we would like to use 
MPC to robustly control the robot during the rollout, we 
instead construct a surrogate cost function £(xt, u*) for MPC 
that will allow us to robustly generate trajectories that have 
high probability under Pi{T). 

This surrogate cost should fulfill a number of criteria in 
order to be effective; first, it must encourage MPC to visit 
states that have high probability under Pi(r); second, it must 
produce good long-horizon behavior even when optimized 
under a short horizon; and third, it must keep the generated 
behavior close to the neural network policy 7re(ut|x(). When 
we run MPC at time step t from the current state Xj, 
we first compute pi{xt/\xt) for each t' G \t + -\- H] 

under the known approximate dynamics and the time-varying 
linear-Gaussian controller obtained from the offline LQG 
optimization. Using pt' and Ej/ to denote the mean and 
covariance of pi(xt/|xt), we can compute these distributions 
according to the following recurrence: 


pt' + l [/x£^ 
Et' + l = [/x£' 


Ut'] 

-f 

+ 

1 


/ut'] 



Q-^+Kt,Et,KT_ 

[/u£'J 


The intuition here is that we would like to figure out 
which states the offline LQG solution would prefer to visit, 
independently of the actions required to get to these states. 
This is important since, in the presence of model errors 
and perturbations, the nonlinear approximate model might 
indicate different actions when combined with MPC, but 
the overall distribution over states should remain similar. 
Once we obtain pi{xt+i ,..., xj+jj|x(), we can marginalize 
to obtain pi{xt') for each time step t' G \t + -\- H], and 

we then construct the surrogate cost as 


f(xt-,Ut/) = -logpi(xt/|xt) 

- j^tMog7re(ut/|xtO - (4) 

We then run MPC on this cost as described in the previous 
section to obtain a new linear-Gaussian controller for time 
step t of the form (ut|xt) = A/'(KtyXt -f ktij, 
and choose the action by sampling from this linear-Gaussian. 
The subscript ij here denotes the jth sample (generated via 
MPC) from the ith trajectory distribution. 

While samples generated in this way are not exactly 
samples from Pi(r), but rather samples from a distribution 
formed by the product of independent marginals at each time 
step, we found that the resulting algorithm was still able to 
produce good training data for the neural network in guided 
policy search. Furthermore, the additional information pro¬ 
vided by Pi{xt' Ixt) allowed MPC to succeed in the presence 
of model errors and disturbances, even on tasks such as 
obstacle avoidance that require long-horizon lookahead. 

One final detail is that both the cost in Equation Q and 
the offline optimization objective in Equation 0 require 
access to log7re(ut|x(), while we only have access to 
Iog7r6i(ut|o(), and Oj is in general a complex and unknown 
function of Xj, since the observation might include, e.g., laser 


rangefinder readings, while the state might consist of the 
vehicle’s position and orientation. To obtain log7r6i(ut|x(), 
we follow prior work and approximately linearize the policy 
by using the previous set of rollouts from the physical 
system. This can be done by fitting a time-varying linear- 
Gaussian model of log7r6i(ut|xt) to the samples, since each 
sample includes both xj and Oj, allowing us to evaluate the 
policy at each sampled state X(. The fitting is done by using 
linear regression with a Gaussian mixture model prior, as in 
previous work [19]. 

C. Training the Nonlinear Policy 

The final nonlinear policy 7re(ut|ot) is trained using 
standard supervised learning, from samples collected via 
MPC. The objective for this supervised learning is given in 
Equation Q, though in the case of MPC-based samples, we 
substitute pij{ut\xt) for pi(ut|xt). In the case of a con¬ 
ditionally Gaussian policy 7rg(ut|ot) = S’’'(o()), 

the KL-divergence Di^i^{TTg{ut\ol’^)\\pij)) in this ob¬ 
jective can be written out as 


T>KL(7r(,(ut|o]’Q||py(ut|x]’Q) = 

2 (p i^t) ~ 9tiji^t))Qu,utij{p (ot) — gtij{x.t)) 

- ^tr[Qu,u£ijE’"(ot)] + i log|E’"(ot)| -I- 

where gtij{xt) = T&ujXt + It-uj- Note that this is simply 
a weighted least squares objective on the mean function 
p^{ot). In this work, we represent p^{ot) with a multilayer 
neural network, which allows us to train flexible and ex¬ 
pressive policies. Since we prefer deterministic or nearly- 
deterministic policies, we choose E’’’(ot) to be constant, 
which means that we can solve for it in closed form 
according to 


^ ^ ( N I 

\ i=l j=l ) 

The neural network mean function p^(ot) is optimized using 
stochastic gradient descent (SGD). As noted earlier, one of 
the key advantages of this type of training approach is that 
the input Oj to the neural network policy need not match the 
state X( used during trajectory optimization and MPC, which 
allows us to train policies that operate directly on raw inputs 
from the onboard sensors. 


Algorithm 2 MPC-guided policy search 
1 : for iteration fc = 1 to AT do 

2: Optimize Pi(T) offline according to Equation 0 

3: Run MPC M times from initial states Xi ^ Pi(xi)to 

create {pij('r)} and {rij} using 1{t) in Equation 0 
4: Train nonlinear policy 7re(ut|o4) to match each 

Pij{vit\x.t) along each Tij, using Equation 0 
5: Eit time-varying linear-Gaussian model to estimate 

7rg(ut|x() around each Pi{T) using samples {rij} 

6: Update z/j and Aj^^ as in [23] 

7: end for 

8: return optimized policy parameters 6 













D. Algorithm Summary 

A summary of our method is presented in Algorithm 
At each iteration, we first generate an offline solution by 
using iterative LQG to optimize the augmented objective in 
Equation ([^. This offline solution allows us to initialize and 
construct the cost for MFC rollouts. We conduct M MFC 
rollouts for each trajectory distribution Pi{T), constructing a 
new surrogate cost £(xt,Ut) at each time step. These MFC 
rollouts provide us with sample trajectories {rij} and MFC 
controllers {Py (r)}, which we can use to train the nonlinear 
neural network policy 7r6i(ut|o() as described in the previous 
section. After the policy is trained, we update our time- 
varying linear-Gaussian fit for 7re(ut|xt) by using the latest 
samples. Note that a separate linear-Gaussian estimate of 
7re(ut|xt) is constructed around each trajectory distribution 
Piir). Finally, we adjust the dual variables as described in 
previous work [23]. 

V. Experimental Evaluation 

We evaluated our method on simulated quadrotor obstacle 
avoidance tasks. 

A. Quadrotor System 

The simulated quadrotor was modeled after 3DR’s 
IRlS-tQ which has width 0.47m, height 0.11m, and weight 
1.5kg. The dynamics followed the formulation described 
by Martin and Salaun [25]. The true state of the ve¬ 
hicle Xt = (p(, Vt, qt, ott) S consisted of the posi¬ 
tion pt = zt) and orientation qt, expressed as a 

quaternion, as well as their time derivatives, i.e. lin¬ 
ear velocity Vt and angular velocity u)f The controls 
Ut G R"^ consisted of rotor velocities. The observation model 
Ot = (rt,Vt, qt, ujt) G R^° lacked the position pt and in¬ 
stead included readings rj from a set of 30 equally spaced 
laser rangefinders with max range 5m arranged in 180 degree 
fan in front of the vehicle. This type of observation model is 
quite challenging to integrate into simple control methods, 
such as time-varying linear controllers, but can be easily 
processed by a multilayer neural network policy. 

B. Cost Function 

The cost function for the iterative offline LQG optimiza¬ 
tion was 

Z(xt, Ut) =10^11 Vt - V* 111 -I- 500||j2t - z*\\%+ 

10 ||qt —q 1 12 + 250||a;t — u) II 2 + 

0.5||ut UHOVERII 2 + 

10 "^ max(dsAFE — signed_distance(xt), 0) 

where x* = Zt),Vt,qt,uJt) is the full state as 

defined previously; z*, v*, q*, a;* are the target height, linear 
velocity, orientation, and angular velocity, respectively; and 
'^HovER Ihe desired rotor velocity for hovering. The final 
term is a hinge loss on the distance of the quadrotor to the 
nearest obstacle; there is no penalty if the nearest obstacle 
is further than dsAPE- 

'https://3dr.com/kb/iris/ 



Fig. 2: The quadrotor must learn to fly around a cylindrical 
obstacle and down a hallway using only onboard sensing. The blue 
semicircle is the range of the onboard laser range finders. 


C. Neural Network Policy 

The neural network policy consisted of two fully con¬ 
nected hidden layers, each with 40 rectified linear (ReLU) 
units [30]. For training the neural network in each iteration 
of guided policy search, we used the ADAM [14] algorithm, 
optimizing using 20,000 minibatches with a minibatch size 
of 50. We used default values for all learning parameters as 
presented in [14]. 

D. Experimental Domains 

Figure 1^ depicts the two simulated environments in which 
we trained our neural network policies: a single cylindrical 
obstacle of radius 0.5m and height 4m, and a straight hallway 
of width 5m and height 4m. For the cylinder avoidance task, 
we used iV = 18 initial states with varying initial (cc, y) 
positions in front of the cylinder; for the hallway task, we 
used N = 6 initial states with varying y-coordinate values. 
For both tasks, each initial state corresponds to a different 
trajectory distribution Pi{T), with M = 4 samples from each 
distribution. Each trajectory t has length T = 150, which 
equates to 7.5 seconds. Collision with an obstacle or the 
ground, or flying above an obstacle were considered as a 
training crash. 

E. Baseline Methods and Model Errors 

To evaluate the importance of using MFC, we trained 
neural network policies on each of the tasks in the presence 
of model errors, using three variants: the full MFC-guided 
policy search algorithm with the surrogate cost 
described in Section [IV-B| , a variant of MFC-guided policy 
search that uses the true cost £(Xi,Ut) with the policy KL- 
divergence term and dual variables for MFC, and a variant 
that does not use MFC at all, and instead performs the 
rollouts by using the time-varying linear-Gaussian policy 
generated by the offline iterative LQG algorithm. All meth¬ 
ods were trained for K = 5 guided policy search iterations. 

We evaluated each of the above variants in the presence 
of four different types of model errors: no model error, the 
actual weight of the quadrotor was 0.05kg (3.3%) greater 
than the expected weight, the two rotors on one side of 
the quadrotor had an 8% multiplicative rotor velocity bias, 
and all model parameters (e.g. moments of inertia, drag 
coefficients) were perturbed by retaining only one significant 
digit. 











train: single cylinder; test: infinite forest 


no model error 

0.05kg mass error 

8% rotor bias 

perturbed model parameters 

method 

(baseline) 

offline 

^(xt,ut) 

(baseline) 

MPC 

f(xt, Ut) 

full 

MPC 

f(xt, Ut) 

(baseline) 

offline 

^(xt,ut) 

(baseline) 

MPC 

f(xt, Ut) 

full 

MPC 
f(xt, Ut) 

(baseline) 

offline 

^(xt,Ut) 

(baseline) 
MPC 
f(xt, Ut) 

full 

MPC 
f(xt, Ut) 

(baseline) 

offline 

^(xt,Ut) 

(baseline) 
MPC 
f(xt, Ut) 

fuU 

MPC 
f(xt, Ut) 

number of 
training crashes 

1 

0 

0 

1 

0 

0 

46 

0 

0 

N/A 

4 

0 

average test 
flight duration (s) 

56.9 

± 27.3 

35.8 
± 22.5 

53.4 
± 22.5 

11.3 
± 5.2 

6.9 
± 2.6 

34.4 

± 8.4 

60.0 
± 15.8 

56.1 
± 30.8 

95.7 

± 9.8 

N/A 

20.8 
± 9.8 

60.5 

± 27.0 



train: straight hallway; test: winding hallway 


no model error 

0.05kg mass error 

8% rotor bias 

perturbed model parameters 

method 

(baseline) 
offline 
^(Xf, Ut) 

(baseline) 

MPC 

f(xt, Uf) 

full 

MPC 

f(xt,ut) 

(baseline) 

offline 

^(xt,ut) 

(baseline) 

MPC 

f(xt, Ut) 

full 

MPC 

^(xt, Ut) 

(baseline) 

offline 

^(Xf,Ut) 

(baseline) 

MPC 

f(xt, Ut) 

full 

MPC 

£(xt, Ut) 

(baseline) 

offline 

^(Xf,Ut) 

(baseline) 

MPC 

f(xt, Ut) 

full 

MPC 

^(xt, Ut) 

number of 
training crashes 

0 

0 

0 

0 

0 

0 

76 

0 

0 

N/A 

0 

0 

average test 
flight duration (s) 

46.2 

± 28.4 

35.2 
± 13.3 

35.2 
± 20.0 

21.7 
± 5.8 

15.7 
± 1.3 

31.8 

± 15.7 

26.0 
± 21.1 

51.1 

± 28.6 

28.5 
± 16.2 

N/A 

9.9 
± 4.2 

55.2 
± 17.5 


TABLE I: Training and test results comparing our full MPC-guided policy search method with two baselines. The test flight duration 
was averaged over 20 runs of the learned policies. In one scenario (top table), the policy search method variants were trained on a single 
cylinder and tested in an infinite forest. In the other scenario (bottom table), the policy search variants were trained in a straight hallway 
and tested in a winding hallway. Each policy search variant was tested with four different model errors. Experiments labelled N/A were 
unable to complete due to excessive crashing. In the majority of experiments, the full MPC-guided policy search method outperformed 
the two baselines, crashing less during training and with the final learned neural network policy flying for the longest duration in the test 
scenarios. 


F. Results 

Table shows the number of crashes experienced by the 
quadrotor during training. These results indicate that MPC 
using the surrogate cost is able to train a successful neural 
network policy without experiencing catastrophic failure. 

To evaluate the generalization of the learned policies, we 
ran the trained neural networks in two test scenarios: an 
infinite forest of cylinders of the same shape as in training, 
but at random positions an average distance of 5m apart, 
and a winding hallway with randomized turns of at most 
30° every 5m. We ran the policies trained with a single 
cylinder in the infinite forest, and we ran the trained policies 
for the straight hallway in the winding hallway. The average 
flight duration of the final trained neural network policies 
are shown in Table |I] With no model etTors, our MPC- 
guided policy search algorithm was comparable to the other 
methods. When model errors were introduced, our method 
outperformed the two baselines in the majority of scenarios. 
Videos of the resulting flights are included as supplementary 
material, and may also be viewed on the project webpag^ 

Our evaluation shows that MPC-guided policy search is an 
effective algorithm for off-policy training of complex neural 
network policies for autonomous aerial vehicles. Our full 
method was able to learn each of the two behaviors without 
experiencing any catastrophic failures during training, and 
the trained policy was able to generalize effectively. 

VI. Discussion AND Future Work 

We presented an algorithm for training deep neural net¬ 
work control policies for autonomous aerial vehicles, by 
using model predictive control to generate guiding samples 
for guided policy search. Our MPC-guided policy search uses 

^http://rll.berkeley.edu/icra2016mpcgps 


a modified MPC algorithm that trades off minimizing the cost 
against matching the current neural network policy, so as to 
generate good training data that can be used to train a better 
policy with standard supervised learning. Since the partially 
trained neural network policy is never used to choose actions 
at training time, the more robust and reliable MPC method 
provides a substantial improvement in safety over traditional 
reinforcement learning methods. Our results show that this 
algorithm is able to learn complex policies, such as high 
speed obstacle avoidance, using raw sensor inputs and low- 
level rotor command outputs. 

One of the key ideas behind our method is the notion 
of an instrumented training setup, which allows MPC to be 
performed at training time with full state observations, which 
could be provided, for example, by using motion capture. At 
the same time, the vehicle gathers observations from its own 
onboard sensors, and trains the policy to mimic the action 
chosen by MPC using only the raw sensor readings, without 
relying on the full state. Acquiring the sensor readings is 
important, because accurately modeling complex sensors, 
such as laser range finders and cameras, is very difficult, 
while obtaining a model of the vehicle that is accurate 
enough to perform MPC is comparatively easier. 

While our approach can train very complex, high¬ 
dimensional policies, it shares many of the limitations of 
prior guided policy search methods [23]. In particular, full 
state observations are required at training time, in order 
to perform MPC, even though the final neural network 
policy can perform the task using only onboard sensors. 
In the real world, this kind of state information could be 
obtained using an instrumented training environment (with, 
for example, motion capture). Since the instrumentation is 
only required during training, the final neural network is still 
able to act in the real world, so this approach is practical 










































for a wide range of robotic tasks. However, not all aerial 
maneuvers can be learned in such an instrumented training 
setup, and combining explicit state estimation with guided 
policy search in future work could lead to a much more 
broadly applicable algorithm. Another direction that can be 
explored in future work is to combine guided policy search 
with more sophisticated MPC and planning algorithms. In 
principle, a wide variety of methods can be used to generate 
guiding samples, and more sophisticated methods might 
afford superior robustness and obstacle avoidance [7]. 
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